EDA of Red Wine Quality

by Ray Wong

Citation

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

This report uses dataset from the referred above to explore what variables are correlated with the quality of red wine.

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Our dataset has 13 variables with 1599 obs.Variable ‘X’ seems like an index number of each ob, so we will just ignore it.The type of ‘quality’ variable is int, and other variables are all numeric variables.We will plot the histogram of each variable to see the distribution of data and outliers if any.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

From the plot we can tell most of the wine quality are 5 or 6. Only a few are with a quality of 3 or 8 which is very low or very high. This seems consistent with our common sense.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## 
##  4.6  4.7  4.9    5  5.1  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9    6  6.1 
##    1    1    1    6    4    6    4    5    1   14    2    4    9   13   16 
##  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9    7  7.1  7.2  7.3  7.4  7.5  7.6 
##   20   14   25   17   37   28   46   38   50   57   67   44   44   52   46 
##  7.7  7.8  7.9    8  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9    9  9.1 
##   49   53   42   42   26   45   40   26   19   27   24   34   33   26   29 
##  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9   10 10.1 10.2 10.3 10.4 10.5 10.6 
##   16   22   17   14   17    9   15   26   23   10   19   11   21   12   14 
## 10.7 10.8 10.9   11 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9   12 12.1 
##   10   10    8    3    9    5    7    5   13   12    3    3   12    7    1 
## 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9   13 13.2 13.3 13.4 13.5 13.7 13.8 
##    4    5    4    7    4    4    5    2    3    3    3    1    1    2    1 
##   14 14.3   15 15.5 15.6 15.9 
##    1    1    2    2    2    1

The distribution of fixed acidity is almost normal distribution, with the peak appears around fixed acidity of 7.2.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## 
##  0.12  0.16  0.18  0.19   0.2  0.21  0.22  0.23  0.24  0.25  0.26  0.27 
##     3     2    10     2     3     6     6     5    13     7    16    14 
##  0.28  0.29 0.295   0.3 0.305  0.31 0.315  0.32  0.33  0.34  0.35  0.36 
##    23    16     1    16     2    30     2    23    20    30    22    38 
## 0.365  0.37  0.38  0.39 0.395   0.4  0.41 0.415  0.42  0.43  0.44  0.45 
##     2    24    35    35     2    37    33     3    31    43    23    22 
##  0.46  0.47 0.475  0.48  0.49   0.5  0.51  0.52  0.53  0.54 0.545  0.55 
##    31    21     2    24    35    46    24    33    29    31     5    20 
##  0.56 0.565  0.57 0.575  0.58 0.585  0.59 0.595   0.6 0.605  0.61 0.615 
##    34     1    28     3    38     3    39     1    47     3    27     6 
##  0.62 0.625  0.63 0.635  0.64 0.645  0.65 0.655  0.66 0.665  0.67 0.675 
##    24     3    29     9    27    12    16     7    26     3    23     3 
##  0.68 0.685  0.69 0.695   0.7 0.705  0.71 0.715  0.72 0.725  0.73 0.735 
##    12    11    23     7    10     6     3    12     5     9     6     8 
##  0.74 0.745  0.75 0.755  0.76 0.765  0.77 0.775  0.78 0.785  0.79 0.795 
##    11     5     6     3     5     5     6     4    10     8     2     2 
##   0.8 0.805  0.81 0.815  0.82 0.825  0.83 0.835  0.84 0.845  0.85 0.855 
##     3     1     2     3     5     1     4     4     8     1     2     3 
##  0.86 0.865  0.87 0.875  0.88 0.885  0.89 0.895   0.9  0.91 0.915  0.92 
##     2     1     4     2     5     5     1     1     3     3     4     1 
## 0.935  0.95 0.955  0.96 0.965 0.975  0.98     1 1.005  1.01  1.02 1.025 
##     2     1     1     3     3     1     3     3     1     1     4     1 
## 1.035  1.04  1.07  1.09 1.115  1.13  1.18 1.185  1.24  1.33  1.58 
##     1     3     1     1     1     1     1     1     1     2     1

The volatile acidity is between 0.12 and 1.58, and there are a few outliers from 1.1 to 1.6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

There are more obs with a citric acid of 0 than with other citric value.and there is an outlier at where the citric value is 1.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
## 
##  0.9  1.2  1.3  1.4  1.5  1.6 1.65  1.7 1.75  1.8  1.9    2 2.05  2.1 2.15 
##    2    8    5   35   30   58    2   76    2  129  117  156    2  128    2 
##  2.2 2.25  2.3 2.35  2.4  2.5 2.55  2.6 2.65  2.7  2.8 2.85  2.9 2.95    3 
##  131    1  109    1   86   84    1   79    1   39   49    1   24    1   25 
##  3.1  3.2  3.3  3.4 3.45  3.5  3.6 3.65  3.7 3.75  3.8  3.9    4  4.1  4.2 
##    7   15   11   15    1    2    8    1    4    1    8    6   11    6    5 
## 4.25  4.3  4.4  4.5  4.6 4.65  4.7  4.8    5  5.1 5.15  5.2  5.4  5.5  5.6 
##    1    8    4    4    6    2    1    3    1    5    1    3    1    8    6 
##  5.7  5.8  5.9    6  6.1  6.2  6.3  6.4 6.55  6.6  6.7    7  7.2  7.3  7.5 
##    1    4    3    4    4    3    2    3    2    2    2    1    1    1    1 
##  7.8  7.9  8.1  8.3  8.6  8.8  8.9    9 10.7   11 12.9 13.4 13.8 13.9 15.4 
##    2    3    2    3    1    2    1    1    1    2    1    1    2    1    2 
## 15.5 
##    1

The histogram is long-tailed .Most of the residual sugar of the obs drop in between 1.5 and 3.0.We will log transform the residual sugar variable and plot a histogram of the transformed variable again to have a better view.

Log transformed residual sugar histogram, the histogram is still right biased.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## 
## 0.012 0.034 0.038 0.039 0.041 0.042 0.043 0.044 0.045 0.046 0.047 0.048 
##     2     1     2     4     4     3     1     5     4     4     4     8 
## 0.049  0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058 0.059  0.06 
##     8    12     1    10     5    13     8     9    10    14    17    16 
## 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069  0.07 0.071 0.072 
##    11    24    22    20    23    32    27    30    21    35    47    24 
## 0.073 0.074 0.075 0.076 0.077 0.078 0.079  0.08 0.081 0.082 0.083 0.084 
##    35    55    45    51    47    51    43    66    40    46    35    49 
## 0.085 0.086 0.087 0.088 0.089  0.09 0.091 0.092 0.093 0.094 0.095 0.096 
##    25    31    25    32    25    21    19    22    21    19    23    18 
## 0.097 0.098 0.099   0.1 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108 
##    18    12     8    13     5    10     7    16     6     8     9     1 
## 0.109  0.11 0.111 0.112 0.113 0.114 0.115 0.116 0.117 0.118 0.119  0.12 
##     3     8     7     6     1    11     5     2     4     8     3     3 
## 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.132 0.136 0.137 0.143 
##     2     7     6     3     1     1     1     1     4     1     1     1 
## 0.145 0.146 0.147 0.148 0.152 0.153 0.157 0.159 0.161 0.165 0.166 0.168 
##     1     1     1     1     2     1     3     1     1     1     3     1 
## 0.169  0.17 0.171 0.172 0.174 0.176 0.178 0.186  0.19 0.194   0.2 0.205 
##     1     1     2     1     1     1     2     1     1     1     1     2 
## 0.213 0.214 0.216 0.222 0.226  0.23 0.235 0.236 0.241 0.243  0.25 0.263 
##     1     3     1     1     2     1     1     1     1     1     1     1 
## 0.267  0.27 0.332 0.337 0.341 0.343 0.358  0.36 0.368 0.369 0.387 0.401 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 0.403 0.413 0.414 0.415 0.422 0.464 0.467  0.61 0.611 
##     1     1     2     3     1     1     1     1     1

The histogram is right biased with a long tail.most of the chlorides values are between 0.04 and 0.12.

Log transformed Chlorides histogram, almost nornal distrubuted.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## 
##    1    2    3    4    5  5.5    6    7    8    9   10   11   12   13   14 
##    3    1   49   41  104    1  138   71   56   62   79   59   75   57   50 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##   78   61   60   46   39   30   41   22   32   34   24   32   29   23   23 
##   30   31   32   33   34   35   36   37 37.5   38   39   40 40.5   41   42 
##   16   20   22   11   18   15   11    3    2    9    5    6    1    7    3 
##   43   45   46   47   48   50   51   52   53   54   55   57   66   68   72 
##    3    3    1    1    4    2    4    3    1    1    2    1    1    2    1

The histogram reaches the peak at around 5 and then drop dowm.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## 
##    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
##    3    4   14   14   27   26   29   28   33   35   26   27   35   29   33 
##   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35 
##   25   25   34   36   27   24   30   43   20   14   32   20   17   20   26 
##   36   37   38   39   40   41   42   43   44   45   46   47   48   49   50 
##   12   26   31   16   17   14   26   18   23   20   17   24   21   21   11 
##   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
##   11   15   14   20   13   10    6   14    9   18    9    9   13   10   17 
##   66   67   68   69   70   71   72   73   74   75   76   77 77.5   78   79 
##    9   12   10    8    8    7   10    7    8    5    3    8    2    4    5 
##   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
##    4    6    4    2    6    9   10    6   14    9    5    7    8    2    8 
##   95   96   98   99  100  101  102  103  104  105  106  108  109  110  111 
##    4    5    7    6    3    4    6    2    5    5    6    3    4    6    3 
##  112  113  114  115  116  119  120  121  122  124  125  126  127  128  129 
##    3    4    2    2    1    7    2    4    3    3    2    1    2    2    3 
##  130  131  133  134  135  136  139  140  141  142  143  144  145  147  148 
##    1    3    3    2    2    2    1    1    3    1    2    3    3    3    2 
##  149  151  152  153  155  160  165  278  289 
##    1    2    1    1    1    1    1    1    1

The histogram is right biased and there are some outliers at around 278 and 289.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
## 
## 0.99007  0.9902 0.99064  0.9908 0.99084  0.9912  0.9915 0.99154 0.99157 
##       2       1       2       1       1       1       1       1       1 
##  0.9916 0.99162  0.9917 0.99182 0.99191  0.9921  0.9922 0.99235 0.99236 
##       2       1       1       2       1       1       2       1       1 
##  0.9924 0.99242 0.99252 0.99256 0.99258 0.99264  0.9927  0.9928 0.99286 
##       3       2       1       1       3       1       1       2       1 
##  0.9929 0.99292 0.99294 0.99306 0.99314 0.99316 0.99318  0.9932 0.99322 
##       1       1       2       1       1       2       1       1       1 
## 0.99323 0.99328  0.9933 0.99331 0.99332 0.99334 0.99336  0.9934 0.99341 
##       1       1       1       2       1       1       1       4       1 
## 0.99344 0.99346 0.99348  0.9935 0.99352 0.99354 0.99356 0.99357 0.99358 
##       1       3       1       1       2       2       4       1       3 
##  0.9936 0.99362 0.99364  0.9937 0.99371 0.99374 0.99376 0.99378 0.99379 
##       2       2       1       2       2       2       3       3       1 
##  0.9938 0.99384 0.99385 0.99386 0.99387 0.99388 0.99392 0.99394 0.99395 
##       1       1       1       1       1       2       2       1       1 
## 0.99396 0.99397   0.994 0.99402 0.99408  0.9941 0.99414 0.99416 0.99417 
##       3       1       2       4       3       1       2       1       1 
## 0.99418 0.99419  0.9942 0.99425 0.99426 0.99428  0.9943 0.99434 0.99437 
##       2       2       3       1       1       1       2       1       1 
## 0.99438 0.99439  0.9944 0.99444 0.99448 0.99451 0.99454 0.99456 0.99458 
##       5       1       3       4       4       1       1       1       4 
## 0.99459  0.9946 0.99462 0.99464 0.99467 0.99468  0.9947 0.99471 0.99472 
##       1       5       2       2       2       1       6       3       3 
## 0.99473 0.99474 0.99476 0.99478 0.99479  0.9948 0.99483 0.99484 0.99486 
##       1       1       3       2       1       9       1       3       1 
## 0.99488 0.99489  0.9949 0.99491 0.99492 0.99494 0.99495 0.99496 0.99498 
##       4       3       4       1       2       4       2       1       5 
## 0.99499   0.995 0.99501 0.99502 0.99504 0.99506 0.99508 0.99509  0.9951 
##       1      10       1       2       2       1       3       1       4 
## 0.99512 0.99514 0.99516 0.99517 0.99518 0.99519  0.9952 0.99521 0.99522 
##       2       5       6       1       3       1       9       1       4 
## 0.99523 0.99524 0.99525 0.99526 0.99528 0.99529  0.9953 0.99531 0.99532 
##       1       4       2       2       3       1       4       2       1 
## 0.99533 0.99534 0.99536 0.99538  0.9954 0.99541 0.99542 0.99543 0.99544 
##       1       6       2      11       4       1       1       2       1 
## 0.99545 0.99546 0.99547 0.99549  0.9955 0.99551 0.99552 0.99553 0.99554 
##       3       7       2       2      14       3       5       1       3 
## 0.99555 0.99556 0.99557 0.99558  0.9956 0.99562 0.99564 0.99565 0.99566 
##       1       2       3       3      14       4       2       3       4 
## 0.99568 0.99569  0.9957 0.99572 0.99573 0.99574 0.99575 0.99576 0.99577 
##       4       1       6       9       1       2       2       5       3 
## 0.99578  0.9958 0.99581 0.99582 0.99584 0.99585 0.99586 0.99587 0.99588 
##       3      14       1       1       2       3       6       2       4 
## 0.99589  0.9959 0.99592 0.99593 0.99594 0.99596 0.99598 0.99599   0.996 
##       1      13       4       2       1       2       2       2      13 
## 0.99603 0.99604 0.99605 0.99606 0.99608 0.99609  0.9961 0.99612 0.99613 
##       2       3       3       2       2       1      10       6       4 
## 0.99614 0.99615 0.99616 0.99617 0.99619  0.9962 0.99621 0.99622 0.99623 
##       2       5       7       1       1      28       1       5       2 
## 0.99624 0.99625 0.99627 0.99628 0.99629  0.9963 0.99631 0.99632 0.99633 
##       3       3       3       3       2      15       1       4       4 
## 0.99634 0.99635 0.99636 0.99638 0.99639  0.9964 0.99641 0.99642 0.99643 
##       3       1       5       5       2      25       1       3       1 
## 0.99645 0.99646 0.99647 0.99648 0.99649  0.9965 0.99651 0.99652 0.99654 
##       1       1       2       3       1      11       1       6       2 
## 0.99655 0.99656 0.99658 0.99659  0.9966 0.99661 0.99664 0.99665 0.99666 
##       6       5       1       2      23       1       3       1       3 
## 0.99667 0.99668 0.99669  0.9967 0.99672 0.99674 0.99675 0.99676 0.99677 
##       1       4       2      13       5       2       5       3       2 
## 0.99678  0.9968 0.99682 0.99683 0.99684 0.99685 0.99686 0.99688 0.99689 
##       1      35       2       2       1       8       3       2       4 
##  0.9969 0.99692 0.99693 0.99694 0.99695 0.99697 0.99698 0.99699   0.997 
##      18       4       2       3       1       1       1       1      24 
## 0.99701 0.99702 0.99704 0.99705 0.99706 0.99708 0.99709  0.9971 0.99712 
##       2       4       3       1       2       4       1      13       4 
## 0.99713 0.99714 0.99716 0.99717 0.99718 0.99719  0.9972 0.99721 0.99722 
##       2       2       2       1       3       1      36       1       1 
## 0.99724 0.99725 0.99726 0.99727 0.99728 0.99729  0.9973 0.99732 0.99733 
##       4       1       1       1       3       1      18       3       1 
## 0.99734 0.99735 0.99736 0.99738 0.99739  0.9974 0.99743 0.99744 0.99745 
##       4       6       5       4       1      22       2       2       9 
## 0.99746 0.99747 0.99748  0.9975 0.99752 0.99754 0.99756 0.99758  0.9976 
##       7       2       3       7       1       1       1       1      35 
## 0.99761 0.99764 0.99765 0.99768 0.99769  0.9977 0.99772 0.99774 0.99779 
##       1       1       1       3       2       4       1       5       1 
##  0.9978 0.99782 0.99783 0.99784 0.99785 0.99786 0.99787 0.99788  0.9979 
##      26       2       2       1       1       4       3       2      14 
## 0.99791 0.99796 0.99798   0.998 0.99801 0.99803 0.99808  0.9981 0.99814 
##       1       1       2      29       2       3       1      10       2 
## 0.99815 0.99817 0.99818  0.9982 0.99822 0.99823 0.99824 0.99828  0.9983 
##       2       2       3      23       1       1       3       2       9 
## 0.99832 0.99834 0.99836  0.9984 0.99842 0.99845  0.9985 0.99852 0.99854 
##       1       1       2      20       2       1       3       1       1 
## 0.99855 0.99859  0.9986 0.99864 0.99865  0.9987 0.99878  0.9988 0.99888 
##       2       1      19       1       2      12       1      20       2 
##  0.9989 0.99892   0.999 0.99901  0.9991 0.99914 0.99915 0.99918  0.9992 
##       2       3       8       1      10       3       1       1       7 
## 0.99922 0.99925  0.9993 0.99935 0.99938 0.99939  0.9994  0.9995  0.9996 
##       1       1       4       1       1       1      24       1      12 
## 0.99965  0.9997 0.99974 0.99975 0.99976  0.9998  0.9999       1 1.00005 
##       1       8       1       1       1      10       1      10       2 
##  1.0001 1.00012 1.00015  1.0002 1.00024 1.00025  1.0003  1.0004  1.0006 
##       4       1       2      10       1       1       2       9       6 
##  1.0008   1.001  1.0014  1.0015  1.0018  1.0021  1.0022 1.00242  1.0026 
##       3       6       6       2       1       2       2       2       2 
## 1.00289 1.00315  1.0032 1.00369 
##       1       3       1       2

The histogram is normal distributed,the peak shows up at around 0.997.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
## 
## 2.74 2.86 2.87 2.88 2.89  2.9 2.92 2.93 2.94 2.95 2.98 2.99    3 3.01 3.02 
##    1    1    1    2    4    1    4    3    4    1    5    2    6    5    8 
## 3.03 3.04 3.05 3.06 3.07 3.08 3.09  3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17 
##    6   10    8   10   11   11   11   19    9   20   13   21   34   36   27 
## 3.18 3.19  3.2 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29  3.3 3.31 3.32 
##   30   25   39   36   39   32   29   26   53   35   42   46   57   39   45 
## 3.33 3.34 3.35 3.36 3.37 3.38 3.39  3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47 
##   37   43   39   56   37   48   48   37   34   33   17   29   20   22   21 
## 3.48 3.49  3.5 3.51 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59  3.6 3.61 3.62 
##   19   10   14   15   18   17   16    8   11   10   10    8    7    8    4 
## 3.63 3.66 3.67 3.68 3.69  3.7 3.71 3.72 3.74 3.75 3.78 3.85  3.9 4.01 
##    3    4    3    5    4    1    4    3    1    1    2    1    2    2

The histogram is normal distributed,the pH value is between 2.74 and 4.01

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## 
## 0.33 0.37 0.39  0.4 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 
##    1    2    6    4    5    8   16   12   18   19   29   31   27   26   47 
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 
##   51   68   50   60   55   68   51   69   45   61   48   46   41   42   36 
## 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 
##   35   23   33   26   28   26   26   20   25   26   23   18   19   15   22 
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 
##   15   13   14   13   13    7    7    8    8    5   10    4    2    3    6 
## 0.98 0.99    1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09  1.1 1.11 1.12 
##    2    3    1    1    3    2    2    3    4    2    3    1    2    1    1 
## 1.13 1.14 1.15 1.16 1.17 1.18  1.2 1.22 1.26 1.28 1.31 1.33 1.34 1.36 1.56 
##    2    2    1    1    5    3    1    1    1    2    1    1    1    3    1 
## 1.59 1.61 1.62 1.95 1.98    2 
##    1    1    1    2    1    1

The histogram is right biased and long-tailed, with some outliers at 1.95 and 2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## 
## 0.33 0.37 0.39  0.4 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 
##    1    2    6    4    5    8   16   12   18   19   29   31   27   26   47 
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 
##   51   68   50   60   55   68   51   69   45   61   48   46   41   42   36 
## 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 
##   35   23   33   26   28   26   26   20   25   26   23   18   19   15   22 
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 
##   15   13   14   13   13    7    7    8    8    5   10    4    2    3    6 
## 0.98 0.99    1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09  1.1 1.11 1.12 
##    2    3    1    1    3    2    2    3    4    2    3    1    2    1    1 
## 1.13 1.14 1.15 1.16 1.17 1.18  1.2 1.22 1.26 1.28 1.31 1.33 1.34 1.36 1.56 
##    2    2    1    1    5    3    1    1    1    2    1    1    1    3    1 
## 1.59 1.61 1.62 1.95 1.98    2 
##    1    1    1    2    1    1

From the log transformed histogram, we are still see the rises and drops at range 0.5 to 0.8

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## 
##              8.4              8.5              8.7              8.8 
##                2                1                2                2 
##                9             9.05              9.1              9.2 
##               30                1               23               72 
## 9.23333333333333             9.25              9.3              9.4 
##                1                1               59              103 
##              9.5             9.55 9.56666666666667              9.6 
##              139                2                1               59 
##              9.7              9.8              9.9             9.95 
##               54               78               49                1 
##               10 10.0333333333333             10.1             10.2 
##               67                2               47               46 
##             10.3             10.4             10.5            10.55 
##               33               41               67                2 
##             10.6             10.7            10.75             10.8 
##               28               27                1               42 
##             10.9               11 11.0666666666667             11.1 
##               49               59                1               27 
##             11.2             11.3             11.4             11.5 
##               36               32               32               30 
##             11.6             11.7             11.8             11.9 
##               15               23               29               20 
##            11.95               12             12.1             12.2 
##                1               21               13               12 
##             12.3             12.4             12.5             12.6 
##               12               13               21                6 
##             12.7             12.8             12.9               13 
##                9               17                9                6 
##             13.1             13.2             13.3             13.4 
##                2                1                3                3 
##             13.5 13.5666666666667             13.6               14 
##                1                1                4                7 
##             14.9 
##                1

The histogram reaches a peak at around 9.4 .

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 12 features(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioixde, density, pH, sulphates, alcohol and quality). the feature quality is int type, other features are all numeric type. Other observations:

  • Most wine obs have a quality of 5 or 6, the lowest quality score is 3 and the highest is 8
  • The median of fixed acidity is 7.90, the mean is 8.32
  • For 75% of the obs,the volatile acidity is less than 0.64
  • The min citric acid is 0, the max is 1, the median is 0.26
  • For 75% of the obs, the residual sugar is less than 2.6
  • Most of the chlorides values are between 0.04 and 0.12
  • The median of free sulfur dioxide is 14
  • Many obs have a lower value of total sulfur dioxide
  • The median of density is 0.9968
  • The pH value are betwenn 2.74 and 4.01, all the wines in the obs are acidic
  • The median of sulphates is 0.62
  • The median of alcohol is 10.20

What is/are the main feature(s) of interest in your dataset?

volatile acidity/citric acid/free sulfur dioxide

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

total sulfur dioxide

Did you create any new variables from existing variables in the dataset?

no

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes,there are some features with a long tail,like residual sugar, chlorides, sulphates.

I log transformed those features when plotting them in order to have a better view of the distribution of the data.

Bivariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Fixed acidity does not seem to have a relationship with quality.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$fixed.acidity and pf$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

The correlation coefficient is only 0.12, which confirms the oservation above that fixed acidity does not have a relationship with quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Seems like volatile acidity has a nagtive correlation with quality, we will get the correlation coefficient below to check it out.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$volatile.acidity and pf$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

The correlation coefficient is -0.39,there is a weak negtive correlation between volatile and quality.

From the scatterplot, it doesn’t look like that citric acid and quality are correlated .

## 
##  Pearson's product-moment correlation
## 
## data:  pf$citric.acid and pf$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

The correlation coefficient is 0.22, which is consistent with our observation.

There is no obvious relationship between residual.sugar and quality observed.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$residual.sugar and pf$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

The correlation coefficient is only 0.01, there is nearly no relationship between residual sugar and wine quality

For wines with higher quality(5-8), quality score drops down as the chlorides rise, but this is not applied to wines of quality 3 and 4.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$chlorides and pf$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

The coorelation coefficient is - 0.13, that is to say, there is hardly any correlation between chlorides and quality.

Again, there is no correlation with free sulfur dioxide and quality

## 
##  Pearson's product-moment correlation
## 
## data:  pf$free.sulfur.dioxide and pf$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606

## 
##  Pearson's product-moment correlation
## 
## data:  pf$total.sulfur.dioxide and pf$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

No correlation again.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$density and pf$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

No strong correlation again.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$pH and pf$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

No correlation again.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

## 
##  Pearson's product-moment correlation
## 
## data:  pf$sulphates and pf$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

No strong correlation again.

Except a few outliers,quality score goes higer when alcohol gets higher. The quality of wine should be correlated to alcohol.we will get rid of those outliers and get a better plot below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Though there are much more scatters in between 9 and 12, we can still see the trend that as alcohl gets higher, the quality gets higher, too. we will check out the correlation coefficient below.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$alcohol and pf$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

The correlation coefficient is 0.47, alcohol and quality are correlated .

Based on Scatterplot matrix,alcohol and volatile acifity are the two factors that correlate to the quality of red wine.the coorelation coefficient is 0.48 and - 0.39. this is consistent with our observation above.

Also considerring values of quality are all int, we could creat a new factor variable of quality and plot some box plot to see whether we could find out something interesting.

From the boxplot we can see that though the bad quality wines(quality levels 3 / 4) have an average alcohol of 10 more than that of quality level 5, the good quality wines(levels 7 / 8) have an average alcohol of more than 10.5. The average alcohol of quality level 8 is even about 11.5.

we can see that as quality level goes higher, volatile.acidity drops down.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

My featuers of interest used to be volatile acidity/citric acid/free sulfur dioxide, since the description in the txt file made me thought so.But after plotting the scatter plots and checked the correlation coefficient between each feature and wine quality,only volatile has a weak coorelation with wine quality

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, alcohol is more correlated to wine quality than other feature do.

What was the strongest relationship you found?

Alcohol seems to be the factor that most strongestly correlated to quality of red wine.

Multivariate Plots Section

Per observstion above, alcohol and volatile acifity are the two factors that correlate to the quality of red wine.the coorelation coefficient is 0.48 and - 0.39. And the next two variables correlated with quality would be sulphates and citric acid, with a correlation coeffcient of 0.25 and 0.23, we will add those variables to the scatterplot of alcohol VS quality and volatile VS quality to see whether we can get something interesting. First we will need to convert the numeric variables sulphates and citric acid to factor with cut function.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality quality_fac sulphates_bucket
## 1       5           5     Sulph_Middle
## 2       5           5       Sulph_High
## 3       5           5       Sulph_High
## 4       6           6     Sulph_Middle
## 5       5           5     Sulph_Middle
## 6       5           5     Sulph_Middle
## 
##      Sulph_Low   Sulph_Middle     Sulph_High Sulph_VeryHigh 
##            420            409            384            386
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality quality_fac sulphates_bucket citric.acid_bucket
## 1       5           5     Sulph_Middle         Citric_Low
## 2       5           5       Sulph_High         Citric_Low
## 3       5           5       Sulph_High         Citric_Low
## 4       6           6     Sulph_Middle    Citric_VeryHigh
## 5       5           5     Sulph_Middle         Citric_Low
## 6       5           5     Sulph_Middle         Citric_Low
## 
##      Citric_Low   Citric_Middle     Citric_High Citric_VeryHigh 
##             403             449             349             398

Then we will plot the scatterplot of alcohol VS quality and volatile VS quality ,color with slphates_bucket and citric.acid_bucket .

Though not that obvious, roughly the overall trend of the color change is from Low to High as the quality gets higher.

The scatterplot of quality VS alcohol colored by citri.acid_bucket looks similar to the one colored by sulphates. Though not that obvious, the color does change as the quality change.

As the quality goes down, there is a color change.

same as above

The scatterplot of quality VS alcohol colored by sulphates_bucket, facet by citric.acidity_bucket, we can see roughly in the plot of ‘Citric_High’ and ‘Citric_VeryHigh’, when alcohol and sulphates goes higher, the quality score gets bigger.

The scatterplot of quality VS alcohol colored by citric.acidity_bucket, facet by sulphates_bucket. Though there are some exceptions, we can see the trend is when the sulphates and citric acid is higher, as alcohol goes higher, the quality gets better.

We can see that there are more wines with good quality when sulphate is high, citric acidity is high, but volatile is low

Just about same as above.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

good quality wine tends to have higher alcohol, lower volatile acidity, higher sulphates and higher citric acidity

Were there any interesting or surprising interactions between features?

Not really.More or less same to the Bivariate Analysis.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No, as we know, the variables alcohol and volatile acidity are not that strongly correlated to quality, a linear model won’t help too much here.


Final Plots and Summary

Plot One

Description One

The histogram shows the distribution of the wine quality.
From the plot we can tell that most of the wine quality are 5 or 6.
Only a few are with a quality of 3 or 8 which is very low or very high.

Plot Two

Description Two

We can see from the scatterplot that as Alcohol goes higher, the quality of wine also goes higher.There is a positive correlation between Alcohol and Wine quality, the correlation is not that strong.

Plot Three

Description Three

As shown in the scatterplot above, as alcohol ,sulphates and citric acid goes higher, the quality score of wine tends to go higher, too.


Reflection

  1. First of all, the working directory is so annoying .the setwd function only appies to the chunk, you have to write code at the very beginning of the rmd to set the working directory. But after I did that, set the working directory to the folder where I put all my dataset, what is despairing is whenever you creat a new file, it will show up there ,which is annoying too, so I just quit and made a copy of the dataset in the same folder with the rmd file.That leaves me in peace for a while until I found something weird. When you creat a new file, sometimes it shows up in folder A and sometime folder B, that is really frustrating. I mean,how much interet you would have left to move on to the EDA adventure if you are kept bothered with annoying things like this.

  2. Secondly,I would say I was trapped by some misunderstandings of EDA.I mean, I know exactely what EDA means, but whenever creating a plot, there was a sound keep telling me that there should be something more something valuable there, you should look more deeply to figure it out. Thus I am afraid this kinda exausted me somehow and the interet and passion of playing with data just fade little by little.

  3. Though I thought I did learn a lot through videos and quits,but when it is my term to do my own analysis, basic funcations aside, it is a little difficult to search for the right function to use, though reviewed the videos again, still have a little bit this kind of issues. Hopefully it would be better once more exercises or projects are done.

  4. And when plotting the scatter plots of the features with quality, when almost all the features have no strong correlation with quality, I was like I did not know what I am doing and what I am going to do ,just have to dig into the volatile acidity and alcohol features. But I always have that concern that those would not work, those would not be convincing enough to predict the wine quality or something.

  5. When use cut function to convert numeric variables to factors, the min value in the left is not included, so for the first interval we need to use a value less then the min value, or there would be NA.

  6. TO DO:A model was not created during this analysis since EDA so far did not show a strong relationship between quality and other variables, a linear regression model won’t predict well, in the futher maybe we could thinking about creating another model(logistic regression ,etc) to predict red wine quality. Also the dataset we are using is only 1599 orbs, kinda small, in the future analysis, it would be great if a larger dataset is available.